Testing BitBIRCH-Lean

Where I Test Different Clusteirng Methods

Data Analysis
Version
Coding
RDKit
Small-Molecules
Author

Tony E. Lin

Published

January 29, 2026

Introduction

As the number of known small-molecules increase, it becomes difficult to analyze each one. Clustering is an important technique that simplifies this, allowing us to group similar structures and sample a representative structure from the group instead. A typical scenario would be trying to select diverse structures for further testing. It makes sense to sample a couple of molecules from the same cluster instead of testing the entire cluster of structurally similar compounds. The issue with clustering occurs when our chemical libraries increase in size. With commercial libraries hitting tens of billions of molecules, clustering with common algorithms can take a long time.

Enter BitBIRCH (Published 2025 in Digital Discovery). Researchers claim that BitBIRCH is > 1,000 times faster than Butina clustering for libraries with 1,500,000 molecules. They also show BitBIRCH taking 5 hours to cluster 1 billion molecules. Not too shabby!

I’ve heared about BitBIRCH online, but it is only in the new year that I was able to make some tests. Here is a good one, where I clustered chembl-33 natural products subset using BitBIRCH, Butina, and KMeans.

TL;DR

I tested BitBIRCH and Butina clustering on a set of 1000 molecules. The speeds on my machine, M2 Pro, are:

  • BitBIRCH: 0.02 Seconds
  • Butina: 0.11 Seconds

Import Statements

The authors to the BitBIRCH paper provides a nice repository for their package, bblean. It comes with quite a few quality of life functions, but I will be mainly focusing on the clustering aspect. A full breakdown of bblean can be found on their documentation here. And of course, it is pip installable:

pip install bblean

::: {.cell ExecuteTime='{"end_time":"2026-01-28T10:07:34.308046Z","start_time":"2026-01-28T10:07:33.973592Z"}' execution_count=1}
``` {.python .cell-code}
import bblean
from rdkit import DataStructs, Chem
from rdkit.ML.Cluster import Butina
from rdkit.Chem import rdFingerprintGenerator
import time

:::

Helper Functions

First, I will create some helper funtions. These are used to calulate how long the function runs and prints it as seconds. I will also have helpfer functions to calculate Butina clustering.

def total_time(total_time):
    time = float(round(total_time, 2))
    print(f"Total Time: {time} Sec")

def tanimoto_matrix(fp_list):
    """
    Calculate tanimoto distance matrix.
    """
    matrix = []
    for i in range(1, len(fp_list)):
        # Compare the current fingerprint against all the previous ones in the list
        similarities = DataStructs.BulkTanimotoSimilarity(fp_list[i], fp_list[:i])
        # Since we need a distance matrix, calculate 1-x for every element in similarity matrix
        matrix.extend([1 - x for x in similarities])
    return matrix

def cluster_fingerprints(fingerprints, cutoff=0.2):
    """
    Cluster fingerprints.
    :param fingerprints:
        molecular fingerprint.
    :param cutoff:
        set the cluster threshold.
    :return:
    """
    # matrix
    distance_matrix = tanimoto_matrix(fingerprints)
    # cluster
    clusters = Butina.ClusterData(distance_matrix, len(fingerprints), cutoff, isDistData=True)
    clusters = sorted(clusters, key=len, reverse=True)
    return clusters

Load Dataset

The testing dataset was taken from the bblean repository. I included it here for this example.

The package comes wtih an easy way to create molecular fingerprints. Using their default method, we can pack the fingerprints. This compresses a typical molecualr fringerprint every 8 bits. So a 2048 bit fingerprint will be compressed into 256 bits. That can save a lot of memory for large molecular libraries.

To speed this notebook up, only the first 1,000 molecules are selected for clustering (During testing, I found that the whole set took too long for the Butina).

# Create the fingerprints and pack them into a numpy array, starting from a *.smi file
smiles = bblean.load_smiles("datasets/18-clustering-with-bitbirch/chembl-33-natural-products-subset.smi")

# take 1000 molecules
smiles = smiles[:1000]

# calculate fingerprint
fps_bb = bblean.fps_from_smiles(smiles, pack=True, n_features=2048, kind="rdkit")

print(f"The number of molecules for clustering: {len(smiles)}")
The number of molecules for clustering: 1000

BitBIRCH Clustering

Keeping to their quickstart tutorial, I ran their BitBIRCH clustering using their default methods. Running this shows it takes roughly 2 seconds

# record time
start = time.time()

# bitbirch clustering
tree = bblean.BitBirch()
tree.fit(fps_bb)

# end time
end = time.time()
total_time(end - start)
Total Time: 0.02 Sec

Butina Clustering

Unfortunatley, during my tests, I could not direclty input the fingerprints calculated using bblean direclty into RDKit. So the fingerprints had to be recalcualted here. The fingerprints are also RDKit fingerprints witha bit size of 2048. Remember, these fingerprints are not “packed” like the one used in BitBIRCH above.

# list of RDKit molecule objects
mols = [Chem.MolFromSmiles(smi) for smi in smiles]

# rdkit fingerprint
fp_gen = rdFingerprintGenerator.GetRDKitFPGenerator(fpSize=2048)
rdkit_fps = [fp_gen.GetFingerprint(x) for x in mols]
# record time
start = time.time()

# butina clustering
clusters = cluster_fingerprints(rdkit_fps)

# end time
end = time.time()
total_time(end-start)
Total Time: 0.11 Sec

BitBirch Clustering - Unpack

While BitBIRCH looks much faster than Butina clustering, remember that that test used the “pack” parameter, compressing the molecular fingerprint bits to 256. To see how this would affect BitBIRCH, I also ran code on the fingerprints that were unpacked.

# unpacked fingerprints
fps_bb_unpacked = bblean.fps_from_smiles(smiles, pack=False, n_features=2048, kind="rdkit")

# record time
start = time.time()

# bitbirch clustering
tree = bblean.BitBirch()
tree.fit(fps_bb)

# end time
end = time.time()
total_time(end - start)
Total Time: 0.02 Sec

Conclusion

So the speeds for 1000 molecules are:

  • BitBIRCH: 0.02 Seconds
  • Butina: 0.11 Seconds

That is a huge speed up. Originally I was going to cluster the whole molecule set. That is 64,086 molecules. The BitBIRCH clustered this set in a couple of seconds. The Butina took… well I lost patience after 30 minutes 🤣.

BitBIRCH looks like a very handle tool, especially for large molecular libraries. They also come equipped with fancy plot functions. Maybe I will explore those in the future.